Text Mining Using HMM and PPM
نویسندگان
چکیده
Text mining involves the use of statistical and machine learning techniques to learn structural elements of text in order to search for useful information in previously unseen text. The need for these techniques have emerged out of the rapidly growing information era. Token identification is an important component of any text mining tool. The accomplishment of this task enhances the function of diverse applications involving searching for patterns in textual data. Several different identification methods have been reported in the literature. HMMs and PPM models have been successfully used in language processing tasks. They have also been applied separately to learning-based token identification. Most of the existing systems are domainand language-dependent. In this thesis, we implement a system that bridges the two well known methods through words new to the identification model. The system is fully domainand language-independent. No changes of code are necessary when applying to other domains or languages. The only thing required is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 76:59% for TCC, and 69:02% for BIB. This is not as good as would be expected from a system which includes language-dependent components. However, our system is more generalized. The identification of date has the best result, 73% and 92% of correct tokens are identified respectively. The system also performs reasonably well on people’s name with correct tokens of 68% for TCC, and 76% for BIB.
منابع مشابه
Alert correlation and prediction using data mining and HMM
Intrusion Detection Systems (IDSs) are security tools widely used in computer networks. While they seem to be promising technologies, they pose some serious drawbacks: When utilized in large and high traffic networks, IDSs generate high volumes of low-level alerts which are hardly manageable. Accordingly, there emerged a recent track of security research, focused on alert correlation, which ext...
متن کاملSpeech enhancement based on hidden Markov model using sparse code shrinkage
This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملMining Adverse Drug Reactions from online healthcare forums using Hidden Markov Model
BACKGROUND Adverse Drug Reactions are one of the leading causes of injury or death among patients undergoing medical treatments. Not all Adverse Drug Reactions are identified before a drug is made available in the market. Current post-marketing drug surveillance methods, which are based purely on voluntary spontaneous reports, are unable to provide the early indications necessary to prevent the...
متن کاملTopic Modeling and Classification of Cyberspace Papers Using Text Mining
The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...
متن کامل